Code
library(readr)
library(dplyr)
library(ggplot2)
library(purrr)
library(stringr)
library(cowplot)
metrics <- read_csv("../metrics/combined_report_metrics.csv")Image 14
library(readr)
library(dplyr)
library(ggplot2)
library(purrr)
library(stringr)
library(cowplot)
metrics <- read_csv("../metrics/combined_report_metrics.csv")Computer vision models are increasingly used in pathology, often achieving performances comparable to expert clinicians. However, these models are typically trained on high-quality images that are costly to produce, while real-world clinical data is often lower in quality. This study evaluates model performance under varying levels of image degradation to identify architectures that are more robust to realistic, lower quality inputs and inform clinical expectations and decisions in practice.
The impact of focus-related artefacts in Whole Slide Imaging (WSI)—specifically blur and noise—on the classification of breast tissue cells was the focus of this experiment. To simulate the low-quality test data often encountered in clinical settings, Gaussian blur and noise were applied to images during testing. Four models were evaluated: ResNet50 (pretrained on ImageNet), a custom CNN, Random Forest, and XGBoost. Deep learning models used normalised inputs to learn local features, while machine learning models applied PCA to capture global patterns.
All models experienced decreased performance with image degradation, however machine learning models were the most robust. They showed negligible drops with noise and only ~10% accuracy loss at extreme blur levels. Notably, both consistently detected >80% of tumours and kept a relatively high precision for immune cells, showing potential for utility as screening tools even if overall accuracy drops.. On the other hand, deep learning models, while initially the most accurate, deteriorated rapidly with augmentation—often defaulting to one or two classes.
ResNet50 remained optimal on high-quality data (≥70% accuracy), but XGBoost demonstrated the best performance under degraded conditions, with stable accuracy and high recall for tumour cells. These results highlight the need to match model choice to the expected quality of clinical input data.
To support clinical interpretation, a Shiny application was developed to visualise the effects of image degradation and model performance at each augmentation level, both overall and by class. The app also enables pathologists to upload their own images and observe model predictions across varying levels of degradation, providing insights into prediction stability, and the most appropriate model for their image quality.
The experiment was designed for full reproducibility. All code and implementation instructions are available at https://github.com/AlanS812/data3888-14. Figure 1 outlines the experimental workflow, with corresponding Python scripts indicated for clarity and ease of use.
The medical field is increasingly turning to Computer Vision models for classification tasks, often distinguishing cell types. With rapid increases in the field, models have come to accuracies of up to 98.5%1, slightly above that of an experienced human pathologist2. The question becomes, can these models maintain their performance in day-to-day practice? In real life, cell images do not always have perfect quality. Investigating how drops in image quality impact classification performance will give medical imagers a better view of how these models will perform on real data.
To address this issue, it is first necessary to determine the common causes and kinds of image degradation. These vary greatly depending on the situation, for example, motion blur is a typical challenge for MRIs3. This report, however, centers on histological H&E stained tissue slides, specifically of breast cancer tissue4. The major cause of quality issues in this case is ‘Whole Slide Imaging’ or WSI causing blur and noise.5
WSI is the common practice of scanning full microscope slides, rather than scanning section by section.5 It has reduced the processing time of a single slide to mere minutes, but there are trade-offs in image quality. Focal points are the points where the camera centers its focus, they are selected automatically or manually. As microscope slides are three-dimensional, if a focus point is selected on a region with different depth to the typical focus depth of the slide, its neighbouring areas will be out of focus. This will lead to blur issues, and more prominent noise due to the camera’s failure to fully capture high frequency information like texture and edges.6
There are ways to mitigate image quality drops: increasing the number of focal points will reduce the prominence of the issue, but not eradicate it entirely. Further, this will also increase processing time, inconveniencing patients and adding delays on overloaded labs.5 Imaging technology is advancing quickly, but with their prohibitive expense, the issue is likely to persist. Therefore, giving clinicians information on how this will impact their models, and what architectures are better suited to lower image quality is crucial to the success of medical image classification.
Data was sourced from the Gene Expression Omnibus, a public repository of biomedical data. A Xenium Analyser was used to produce high resolution whole-slide images of breast tissue with a tumour present4, an alternative to the lower quality images typically generated by WSI. Individual cells were identified, cropped to an image, and labelled7. Though cell images with a pixel border of both 50 and 100 were provided, only the 100 image set was used due to their higher historical performance, and to balance computational constraints given the high volume of raw data provided8. All images originated from the same selected slide.
Images were grouped into classes based on their role in breast cancer progression and diagnosis. Tumour cells were grouped for their direct pathological relevance, immune cells for their similar functional responses to disease, and stromal cells for their structural role in the tumour environment. The “other” group included cells that didn’t clearly fit the main categories but were retained to help the model learn to distinguish diagnostically important cells from potentially less relevant ones. [25][26]
Each class was randomly and equally sampled to ensure the model learns from all biologically important categories, not just those most prevalent in the provided tissue. In real tissue, critical cells like tumour or immune types may be rare, so underrepresenting them could potentially weaken the model’s diagnostic ability. Unlabelled cell images were excluded to ensure consistent truth labels and avoid introducing noise into the training process.
To balance model performance with computational constraints, 20,000 of the given ~175,000 images were used, with 75% for training, 10% for validation, and 15% for testing. 3 test sets were used Although a typical data split allocates 80% for training, 10% for validation, and 10% for testing, the test set was increased to enhance the reliability of the final evaluation metrics. Three separate test sets were used with metrics averaged to assess model stability. This approach provides greater insight into variability across test sets, offering medical professionals a clearer understanding of model reliability in practice.Due to natural variation in cell size, cropped images were not uniform, and were therefore resized to 224×224 pixels using Lanczos downsampling (Duchon 1979)9 to meet ResNet50’s input requirements.
To assess the impact of reduced image quality on classification performance, two classical machine learning models and two deep learning models were trained to compare their robustness and find if one had an advantage over the other.
The selected models were chosen based on their compatibility with high-dimensional medical image data. CNNs and Imagenet trained ResNet were used for their ability to extract spatial features directly from pixel data, while Random Forest and XGBoost were selected for their robustness to high-dimensional, potentially redundant features, particularly when using HOG and PCA to make the feature space more compact and informative. All 4 have had particular success in medical image classification in previous studies [21][22][23][24]. Models such as SVM and k-NN were avoided due to their difficulty with increased data scale and sensitivity to the high dimensional nature of image-based classification tasks.
Images were normalised with ImageNet parameters for both deep learning models to ensure comparability between results. For machine learning models, three methods of input were tested: raw pixels, Histogram of Gradients (HOG) and Principal Component Analysis (PCA).10 Raw pixels were too computationally expensive to scale, and HOG had poor results, likely due to the loss of intensity and colour information which is vital in stained images.
PCA gave the best performance while minimising space, converting 50,000 pixels to 100 components and accounting for ~60% of variance in the dataset. Images were flattened, then the principal components (PC) were fit to the training dataset. Testing images were linearly transformed with the pre-fit PC, not refit to the testing data to avoid data leakage. Each principal component can be visualised as a 224x224 image with some global feature and transformed images will give insight into how the model decides, allowing for better interpretability, vital for high-stakes medical decisions. [15]
Models were tested using the same testing set at different augmentation levels and combinations of Gaussian blur and noise to simulate WSI damage6. Initially both were tested at small increments, which were increased once a pattern emerged to manage computation. Blur was tested on kernel sizes from 0 to 19, and noise from 0 to 30, this provided a full range of high quality to completely degraded images.
To evaluate overall performance, metrics of accuracy, confusion matrices, average maximum confidence and weighted F1, precision, recall scores were taken. For a per class breakdown, metrics of precision, recall, f1, average confidence and standard deviation and prediction count were taken.
blur_plot_data <- metrics %>%
select(test_set, blur_size, noise_level, accuracy, Model_Label) %>%
filter(Model_Label %in% c("RF (PCA)", "XGBoost (PCA)", "CNN", "ResNet")) %>%
group_by(blur_size, noise_level, Model_Label) %>%
summarise(accuracy = mean(accuracy), .groups = "drop") %>%
mutate(
noise_level = as.factor(noise_level),
Model_Label = factor(Model_Label)
)
print(ggplot(blur_plot_data, aes(x = blur_size, y = accuracy, color = noise_level, group = noise_level)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
facet_wrap(~ Model_Label, ncol = 2) +
labs(
x = "Blur Radius",
y = "Accuracy",
color = "Noise Level"
) +
scale_color_brewer(palette = "Blues") +
theme_minimal(base_size = 13) +
theme(
strip.text = element_text(face = "bold", size = 12),
legend.position = "right"
))As shown in Figure 2, deep learning models perform best on unaugmented images, but accuracy drops sharply with any augmentation—except for precision. In contrast, the machine learning models maintain relatively stable accuracy, F1, and weighted recall across augmentation levels, and mostly stable precision.
Precision in deep learning exceptionally stays stable up to high levels of blur provided no noise is applied. Once noise is applied, only immune and stromal are predicted. When blur is applied, it predicts ‘other’ and ‘tumour’ extremely rarely, artificially inflating the average precision.
Noise causes the sharpest drop in accuracy for the CNN and ResNet50 models, with ResNet50 dropping to 30% after noise reaches a standard deviation of 1. The CNN tolerates noise up to level 3 before dropping sharply. Both models are highly effective in extracting spatial features from data - relating the information of several neighbouring pixels rather than individual values. Due to these close pixel relationships in the CNNs, it makes sense that slight changes to each pixel’s colour - and subsequently all surrounding pixels - would have a significant impact on the model’s performance. Mayer et al. (2022)11 similarly found that denoising software improved CNN performance.
Blur also caused a rapid performance drop, especially beyond kernel sizes of 9 pixels—though less severe than noise. This result is not unsurprising as even to the human eye, this level of blurring appears to be significant. This is consistent with the performance seen above with Gaussian noise applied, however to a less extreme extent as neighbouring pixels are not being altered using the same normally random distribution, rather as a function of their neighbouring colour channels. As a result the impact of blurring is less intense at low levels than noise, with only minor performance decreases.
Several other studies have similar findings, showing that models trained using sharp images struggle to generalise with images that have been blurred. Jang & Tong (2021)12 interestingly tested the effects of training CNNs using blurred images originally, then sharp images later and found that these models consistently performed better than those that were trained using only sharp images (as we have done in our investigation).
On the other hand, although starting at lower accuracies the machine learning models are more robust to drops in quality. Both models have limited drops with a blur kernel size of up to 5 and hardly any impact from noise levels of up to 30.
The robustness of these models is most likely due to the Principal Component dimension reduction. This makes intuitive sense once visualised in Figure 3, it is clear that once the principal component transformation has been applied, the images appear blurred, and effectively lose their noise. As the Principal Components (PCs) are global - capturing the whole image, noise will simply be discarded. Comparatively, a CNN based model will examine local patterns which can be distorted by localised noise. Notably, PCA is often used as a denoising technique13, which explains its robustness to noise in classification tasks.
The machine learning models are slightly less robust to blur. As blur increases, the linear transformation can detect less information. However, even at extreme levels where it is uninterpretable to the human eye, the model can still extract colour, spatial and intensity information. Again, due to its global nature it is able to capture larger scale features, so it can continue extracting relevant patterns even with limited local information.
Thus, XGBoost with PCA is best for low-quality data; ResNet50 performs best on high-quality inputs.
# need to fix caption here
# parse confusion matrices
parse_cm <- function(cm_str) {
nums <- as.numeric(unlist(str_extract_all(cm_str, "\\d+")))
matrix(nums, nrow = 4, byrow = TRUE)
}
# average across test sets
get_avg_cm <- function(df, model_name, blur_val, noise_val) {
df %>%
filter(Model_Label == model_name, blur_size == blur_val, noise_level == noise_val) %>%
pull(confusion_matrix) %>%
map(parse_cm) %>%
reduce(`+`) %>%
`/`(3) # divide by 3 test sets
}
plot_cm <- function(cm, title) {
df <- as.data.frame(as.table(cm))
colnames(df) <- c("True", "Predicted", "Freq")
df$True <- factor(as.integer(df$True),
levels=1:4,
labels=c("Immune","Other","Stromal","Tumour"))
df$Predicted <- factor(as.integer(df$Predicted),
levels=1:4,
labels=c("Immune","Other","Stromal","Tumour"))
ggplot(df, aes(x=Predicted, y=True, fill=Freq)) +
geom_tile(color="white") +
geom_text(aes(label=round(Freq, 1)), size = 4) +
scale_fill_gradient(low="white", high="dodgerblue") +
coord_fixed() +
labs(title=title) +
theme_minimal(base_size = 12) +
theme(
axis.title = element_blank(),
legend.position = "none",
plot.title = element_text(hjust=0.5,
size=12,
face="bold",
margin=margin(b=5)),
plot.margin = margin(t=5, r=5, b=5, l=5),
#nlarge & rotate x‐labels so they don't collide
axis.text.x = element_text(
size = 10,
angle = 45,
hjust = 1,
vjust = 1,
margin = margin(t = 5)
),
#give y‐labels a bit more breathing room
axis.text.y = element_text(
size = 10,
margin = margin(r = 5)
)
)
}
# Generate plots for all models and both augmentation levels
p1 <- plot_cm(get_avg_cm(metrics, "XGBoost (PCA)", 0, 0), "XGBoost – No Augmentation")
p2 <- plot_cm(get_avg_cm(metrics, "XGBoost (PCA)", 19, 30), "XGBoost – Max Augmentation")
p3 <- plot_cm(get_avg_cm(metrics, "ResNet", 0, 0), "ResNet – No Augmentation")
p4 <- plot_cm(get_avg_cm(metrics, "ResNet", 19, 30), "ResNet – Max Augmentation")
combined1 <- plot_grid(
p1, p3, p2, p4,
nrow = 2,
align = "hv",
axis = "tblr"
)
# axis labels
labeled1 <- add_sub(combined1, "Predicted Class", vpadding = grid::unit(1, "lines"))
labeled1 <- ggdraw(labeled1) +
draw_label("True Class", angle = 90, x = 0, y = 0.5, vjust = 1.5)
labeled1Looking to Figure 4, following image augmentation, ResNet50 predominantly predicted immune and stromal classes, while the CNN defaulted almost exclusively to stromal. Machine learning models initially achieved high precision for the immune class (~80%), which declined to ~60% under full augmentation, and tended to predict stromal and tumour. The ‘other’ class was rarely predicted by any model, likely due to its definition based on function rather than consistent visual features.
Computer vision models are increasingly used in pathology, often achieving performances comparable to expert clinicians. However, these models are typically trained on high-quality images that are costly to produce, while real-world clinical data is often lower in quality. This study evaluates model performance under varying levels of image degradation to identify architectures that are more robust to realistic, lower quality inputs and inform clinical expectations and decisions in practice.
The impact of focus-related artefacts in Whole Slide Imaging (WSI)—specifically blur and noise—on the classification of breast tissue cells was the focus of this experiment. To simulate the low-quality test data often encountered in clinical settings, Gaussian blur and noise were applied to images during testing. Four models were evaluated: ResNet50 (pretrained on ImageNet), a custom CNN, Random Forest, and XGBoost. Deep learning models used normalised inputs to learn local features, while machine learning models applied PCA to capture global patterns.
All models experienced decreased performance with image degradation, however machine learning models were the most robust. They showed negligible drops with noise and only ~10% accuracy loss at extreme blur levels. Notably, both consistently detected >80% of tumours and kept a relatively high precision for immune cells, showing potential for utility as screening tools even if overall accuracy drops.. On the other hand, deep learning models, while initially the most accurate, deteriorated rapidly with augmentation—often defaulting to one or two classes.
ResNet50 remained optimal on high-quality data (≥70% accuracy), but XGBoost demonstrated the best performance under degraded conditions, with stable accuracy and high recall for tumour cells. These results highlight the need to match model choice to the expected quality of clinical input data.
To support clinical interpretation, a Shiny application was developed to visualise the effects of image degradation and model performance at each augmentation level, both overall and by class. The app also enables pathologists to upload their own images and observe model predictions across varying levels of degradation, providing insights into prediction stability, and the most appropriate model for their image quality.
The experiment was designed for full reproducibility. All code and implementation instructions are available at https://github.com/AlanS812/data3888-14. The {fig 1} below outlines the experimental workflow, with corresponding Python scripts indicated for clarity and ease of use.
The application is designed to complement the real-world workflows of medical imaging professionals and diagnosticians, where visual assessment is central. By pairing augmented images with model performance metrics, it bridges the gap between human judgement and machine classification, supporting interdisciplinary decision-making.
The backend integrates two key pipelines: a pre-computed (but dynamically updatable) metrics pipeline, and pre-trained models that allow new predictions from user-uploaded images.
Page 1 displays example cell images from each class under user-selected blur and noise levels, alongside graphical performance summaries to allow quick model comparison under the augmentations applied.
Page 2 provides per-model class-level performance, including confidence and confusion matrices, supporting evaluation of model reliability and cell-type-specific performance as required.
Page 3 allows users to upload an image and view predictions across all augmentation combinations, helping assess model behaviour under novel, real-world image conditions.ns.
Blur Simulation: Blur was uniformly applied across images, unlike real-world cases where focus artefacts vary spatially. As image degradation in practice is more complex than simple augmentations, our simulation may not fully reflect real-world conditions. Future work could implement spatially localised or random blur for more realistic degradation.
Data Size: Due to computational limits, only a subset of data was used. Scaling to the full dataset may enhance accuracy and robustness, especially for deep learning models.
Sampling Strategy: Although class labels were balanced, cell subtypes were not, potentially introducing bias. More granular sampling could improve representation.
Domain Generalisability: The study focused on breast tissue. Results may not generalise to other tissues or modalities with different degradation patterns. Broader testing is needed to assess robustness.
PCA and Deep Learning Integration: PCA boosted robustness in machine learning models. Incorporating PCA-reconstructed inputs into deep learning may balance CNN accuracy with global feature stability.
Model Development: Augmentations were test-time only. Training with them could improve resilience. Techniques like denoising or deblurring might also help counter degradation effects.
Accurate identification of cell types in histological breast tissue is essential for effective, early cancer diagnosis and treatment planning. However, real-world slides often suffer from quality issues, such as the blur and noise introduced by WSI, a critical consideration when needing to detect the presence of critical classes like tumour cells.
This study evaluated four classification models across varying levels of image degradation, simulated with blur and noise. They were tested on a four-class problem, distinguishing breast tissue cells based on their function in cancer. It was found that image quality had a considerable effect on both machine and deep learning model performance. While our ResNet had the best performance on unaugmented images, PCA based machine learning models achieved stable and predictable performances even under severe image quality degradation. This suggests the potential for use of simpler machine learning models in instances where histological images have low quality, but the importance of correct cell type identification remains high.
This exploration underscores the importance of aligning the choice of model with the expected image quality in real-world workflows of pathologists and histology-based diagnosticians. To support this, an interactive Shiny application was developed to visualise how models react to different levels of image quality, so that medical professionals can make an informed choice of model based on their image quality and diagnostic priorities.